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27 Abstract 

28 A two-step ensemble recentering Kalman filter (ERKF) analysis scheme is introduced. The 

29 algorithm consists of a recentering step followed by an ensemble Kalman filter (EnKF) analysis 

30 step. The recentering step is formulated such as to adjust the prior distribution of an ensemble of 

31 model states so that the deviations of individual samples from the sample mean are unchanged 

32 but the original sample mean is shifted to the prior position of the most likely particle, where the 

33 likelihood of each particle is measured in terms of closeness to a chosen subset of the 

34 observations. The computational cost of the ERKF is essentially the same as that of a same size 

35 EnKF. 

36 

37 The ERKF is applied to the assimilation of Argo temperature profiles into the OGCM component 

38 of an ensemble of NASA GEOS-5 coupled models. Unassimilated Argo salt data are used for 

39 validation. A surprisingly small number (16) of model trajectories is sufficient to significantly 

40 improve model estimates of salinity over estimates from an ensemble run without assimilation. 

41 The two-step algorithm also performs better than the EnKF although its performance is degraded 

42 in poorly observed regions. 

43 

44 Keywords: 

45 Data assimilation; Kalman filter; ensemble Kalman filter; Particle filter; 

46 Coupled data assimilation 
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47 1. Introduction 

48 Since its introduction by Evensen [1994, 1996] in the context of a quasigeostrophic ocean model, 

49 the ensemble Kalman filter (EnKF) has gained wide acceptance among the atmosphere and 

50 ocean modeling communities as a viable data assimilation technique. In an EnKF, the 

51 prohibitive cost associated with evolving the model-background error-covariance matrix 

52 according to the traditional Kalman filter [ Kalman , 1960] formulation is avoided. Rather, the 

53 background-error covariance propagation is replaced with the concurrent integration of an 

54 ensemble of model trajectories. The forecast-error covariance statistics needed to compute the 

55 Kalman gain are then estimated from the statistical distribution of the ensemble of model states. 

56 The EnKF uses the ensemble mean to estimate the true state of the dynamical system. The 

57 posterior ensemble variance serves to estimate the analysis error variance. 

58 

59 The sample mean is the most likely forecast if the ensemble is normally distributed. A nonnal 

60 distribution is often assumed in climate system data analysis, either for convenience or for lack 

61 of a better assumption. However, it is known from nonlinear dynamical system theory and 

62 confirmed from observations that clustering around several likely forecast solutions can occur. 

63 For example, the Kuroshio is in a multiple equilibrium state in which meandering and straight 

64 paths are equally plausible but the intennediate mean path is unstable. Because of this, the 

65 central forecast (defined as the ensemble member closest to the mean in terms of RMS 

66 difference) is sometimes used to estimate the true state. 

67 

68 Another problem associated with using the sample mean to estimate the true state is that while 

69 the individual ensemble members are dynamically balanced states, their mean generally is not. 

70 In a multivariate EnKF in which certain prognostic model variables are updated as observations 

71 of other variables are assimilated, the resultant analysis increments can lead to one or more 

72 ensemble members becoming dynamically unstable when the analyzed ensemble is advanced 

73 further in time. 

74 

75 Particle filters are a class of estimation methods that do not estimate the true state with the 

76 sample mean. Instead, the true-state estimates obtained with particle filters are statistically 

77 plausible, dynamically balanced states. The original particle filter algorithm, sample importance 

78 resampling [SIR: Gordon et al., 1993], approximates the true state with a weighted sum of 

79 particles (ensemble members) where the weights are proportional to the respective probability of 

80 each particle. In nonlinear systems, SIR can estimate the true system state more accurately than 

81 the EnKF when the number of particles is large enough. Nevertheless, for the small ensemble 

82 sizes typically used in GCM ensemble prediction, i.e. a few tens to a few hundred ensemble 

83 members, the EnKF is generally more accurate [ Weerts and El Serajv, 2006]. Besides, the 

84 weighted-particle SIR estimate is just as likely to be unstable as the EnKF ensemble mean. If, 

85 however, the most likely particle is used to estimate the state, the estimate will generally be 

86 stable and balanced. Note however that an even larger number of particles will be required for 

87 such as scheme to be accurate. 

88 

89 The purpose of this note is to introduce a two-step ensemble recentering Kalman filter (ERKF) 

90 analysis procedure that combines advantages from the EnKF and SIR schemes. It is shown that 

91 the ERKF can produce a more accurate state estimate than a same-size EnKF with dramatically 

92 less particles than a typical SIR filter would require. Our demonstration uses the NASA Global 
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Modeling and Assimilation Office (GMAO) global earth observing system (GEOS) integrated 
ocean data assimilation system (iODAS) applied to the assimilation of Argo temperature profiles 
into the OGCM component of the GEOS-5 coupled system. The ERKF is introduced in Section 
2. The next two Sections contain overviews of the CGCM and of the data assimilation system. 
The validation experiments are discussed in Section 5. Our conclusions follow in Section 6. 

2. The Two-Step Analysis Procedure 

Let us consider a nonlinear model, M, that evolves an approximation, x, of the true system state, 
x f , subject to a forcing term,/ We define an observation operator, H, that maps the true system 
state to an observation vector, y. Then, the system that defines the estimated state evolution can 
be written: 


— = M( x, f ) + q, (la) 

at 

y = H( x t ) + r, (lb) 

E((x-x t )(x-x t ) T ) = P, E(qq‘ ) = Q, E(rr T ) = R, (lc) 


where E is the usual expectation operator. 

We want to optimally estimate x t given information about x, y and their error distributions. The 
Kalman filter [Kalman, 1960] tells us that the estimate that minimizes the analysis error variance 
is 


x a =x f + PH T {HPH T + - Hx f ), (2) 

where xf is the prior state estimate (model forecast). 

In an EnKF, P is estimated from the statistical distribution of an ensemble of model forecasts, 
x i ,i = l each evolved according to ( 1 a), so that 

S = {xj - x, ■ ■ ■ , x n - *}, (3a) 

SS T , (3b) 

n - 1 


where the overbar denotes 
xf in (2). 

The first (recentering) step 


the sample mean. The analysis for Xj follows 


of the ERKF works by finding the most likely 


a. 


-,=L 


'y k - H kX P \ 


( 4 ) 


from substituting x-, for 
sample, x p , such that 
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is minimized, where the k subscript runs over a chosen subset of the latest batch of observations, 
and ru and Hk are the measurement error and observation operator for the Ath datum. A pre- 
analysis increment, A = x - x, is then applied to each ensemble member. This places the 

ensemble mean at the original “location” of x p without affecting our estimates of P and R since 
the sample distribution relative to its mean is not changed (Figure 1). 

The second step uses (2) and (3) to compute the usual EnKF analysis from the ensemble of 
x\ = x [ +A p ,i = \,...,n, thus producing a set of analysis increments, 

4 = PH' (hPH t + rY ( y - Hx + r ), i = l,...,n, (5) 

so that the total increment applied to ensemble member i is A p +A ; . 

Note that the observation vector used to identify x p could be different from that used in the 
Kalman filter update (5). 

3. The GEOS- 5 Coupled Model 

The GEOS-5 CGCM couples the NASA GEOS-5 atmospheric model [Rienecker et al, 2008] 
and GFDL’s Modular Ocean Model version 4 (MOM4) and the Los Alamos CICE sea ice model 
[ Hunke et al., 2013] using the Earth System Modeling Framework [ESMF: Hill et al., 2006]. 
Ocean- atmosphere coupling is done through a physical interface layer of constant depth (2m). 
Atmospheric fluxes of heat and fresh water are applied at the top of this layer and oceanic 
turbulent fluxes are parameterized at the bottom of the layer. The mass of the layer and amount 
of mixing at its base are chosen so that the layer simulates the diurnal cycle of SST. 

The GEOS-5 configuration used here runs MOM4 and CICE on a tripolar grid with uniform 1/2° 
zonal resolution, variable meridional resolution ranging from 1/6° in the Tropics to 1/2° in the 
high latitudes, and has 40 vertical levels. The dimension of the grid is 720x410x40 
(zonalxmeridionalxvertical). The AGCM has uniform 5/4° zonal resolution, 1° meridional 
resolution and 72 vertical levels (288x181x72 grid dimension). The system is run with a 15- 
minute time step. A faster, 15-second barotropic time step is also used to run MOM4. 

The CGCM system has close to 100-million prognostic state variables in the configuration used 
herein. The ocean data assimilation modifies oceanic fields of temperature (T), salinity (S), 
zonal current (El), meridional current (V) and sea surface height (SSH) as well as sea ice 
concentration and thic kn ess. 

4. GEOSiODAS 

The GEOS integrated ocean data assimilation system (iODAS) system has evolved from the first 
generation GMAO ocean data assimilation system introduced originally in Keppenne and 
Rienecker [2003] and discussed in more detail in Keppenne et al. [2005, 2008]. iODAS is 
implemented as an ESMF [Hill et al, 2004] gridded component. The communications between 
the ocean and sea ice models and iODAS are managed by ESMF. The system is used routinely 
to produce the GMAO production ocean analysis [ Vernieres et al., 2012]. What follows is a 
summary of the main differences between the analysis algorithms used here and in Keppenne et 
al. [2008]. 
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As is customary in EnKF applications to large-scale atmospheric or ocean models, the 
background-error covariances are localized to address the degree-of-freedom limitations 
encountered when the sample size is much less than that of the model state vector. The error- 
covariance localization is flow adaptive (following neutral density surfaces) and an iterative 
procedure is used to individually optimize the covariance localization scales involved in the 
processing of each observation. Incremental analysis updating [IAU: Bloom et al, 1996] is used 
to apply them gradually over the assimilation interval. 

The ocean assimilation is applied to the full CGCM, with the atmospheric component 
constrained during the integration by “replaying” the NASA Modem Era Retrospective-analysis 
for Research and Applications [MERRA: e.g., Rienecker et al., 2008] into the GEOS-5 AGCM. 
The procedure involves integrating the AGCM to the next synoptic analysis time, reading the 
MERRA analysis fields and calculating analysis increments by taking their difference from the 
background atmospheric fields, rewinding the AGCM and, finally, integrating the full CGCM 
while incrementally applying both the atmospheric analysis increments thus computed and the 
ocean analysis increments produced by iODAS. The implementation is designed to facilitate 
consistent atmosphere-ocean states to initialize GEOS-5 seasonal climate forecasts. More details 
about the replay procedure are available in Vernieres et al. [2012]. 

5. Experiments and Results 

To test the ERKF, four experiments were ran spanning March- June 2006 and using 16 model 
trajectories. The initial condition comes from integrating a single instance of the CGCM while 
constraining its AGCM component by replaying the MERRA reanalysis. From this initial 
condition, the ensemble is first spun up during March 2006 by perturbing the analysis increments 
of the atmospheric analysis replay procedure and also applying daily perturbations to the OGCM 
T and salinity S fields. The ensemble configuration as of March 31 2006 is then used to 
initialize each ensemble ran. 

CE-16 is a control ensemble run without assimilation. ER-16 only applies the ensemble 
recentering step, EnKF- 16 only applies the EnKF analysis step and ERKF- 16 applies both the 
recentering and EnKF analysis steps. Starting on April 1, 2006, Argo T profiles are assimilated 
daily in EnKF- 16 and in ERKF- 16. These same Argo T profiles are also used to select x p in the 
recentering steps of ER-16 and ERKF-16. The average wallclock run times per simulation day 
on 960 2.8 GFIz Intel Altix Sandy Bridge cores are 687 seconds in CE-16, 691 seconds in ER-16, 
744 seconds in EnKF- 16 and 749 seconds in ERKF-16, illustrating the minimal cost of the 
recentering step. The perfonnance of the data assimilation and of the ensemble recentering is 
assessed from how closely each run can reproduce the assimilated Argo T profiles and the 
unassimilated Argo S profiles. 

The time mean April- June 2006 global RMS OMF differences of each run with the Argo T and S 
data are listed in table 1 . They show that applying the recentering step without following it with 
an EnKF analysis has only a small impact on the RMS T OMF (7% reduction from CE-16 to ER- 
16) and that there is no noticeable benefit for T in applying the recentering step on top of the 
EnKF analysis step (1% RMS T OMF increase from EnKF- 16 to ERKF-16). However, the 
results are markedly different in terms of the unassimilated S variable. The RMS S OMF from 
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224 ER-16 are 24% lower than those from CE-16 and those of ERKF-16 are 24% lower than those 

225 from EnKF-16. In contrast with ER-16 and ERKF-16, while EnKF-16 performs best for T, it 

226 does not significantly improve over CE-16 for S. 

227 



CE-16 

ER-16 

EnKF-16 

ERKF-16 

RMS T OMF (°C) 

1.62 

1.51 

0.91 

0.92 

RMS S OMF (PSU) 

0.51 

0.39 

0.49 

0.37 


228 Table 1 . April-June 2006 global mean RMS OMF differences with the assimilated Argo T data 

229 and unassimilated Argo S data in each of the CE-16, ER-16, EnKF-16 and ERKF-16 runs. 

230 

231 Figures 2 and 3 expand upon the information provided by Table 1. Figure 2 shows the vertical 

232 average of the RMS OMF for the assimilated Argo T data averaged over the 3 -month data 

233 assimilation period and binned to 3°x3° horizontal boxes. Figure 3 shows the corresponding 

234 binned RMS OMF horizontal distributions for the unassimilated Argo S data. For the T data, 

235 Figures 2c (EnKF-16) and 2d (ERKF-16) show very similar distributions, while the distribution 

236 of RMS T OMF in ER-16 (Figure 2b) resembles closely that of CE-16 (Figure 2a). For the 

237 unassimilated S data, there are strong similarities between the distributions of RMS OMF of CE- 

238 16 (Figure 3a) and EnKF-16 (Figure 3c) and of ER-16 (Figure 3b) and ERKF-16 (Figure 3d). 

239 

240 To better understand the effect of the recentering step on the unassimilated S variable, Figures 4 

241 and 5 show the differences of the RMS S OMF of CE-16 from those of EnKF-16 and ER-16. 

242 Figure 4 corresponds to the upper 200 meters. Figure 5 is for the 200-2000 meter depth range. 

243 Warm (cold) colors indicate that the analysis is closer to (further away from) the unassimilated 

244 Argo S data than the control. Clearly, the better performance in terms of RMS S OMF of the 

245 recentering step from ER-16 over the EnKF analysis from EnKF-16 is mostly due the 

246 recentering’s effectiveness in the upper 200 meters, where the RMS S OMF of EnKF-16 are not 

247 significantly different from those of CE-16 (Figure 4). Both the recentering and the EnKF 

248 analysis are effective below 200 meters, but while ER-16 is generally closer than both EnKF-16 

249 and CE-16 to Argo S in the tropics where there are more Argo profiles than in the extratropics, it 

250 is less effective south of 45°S where the observations are sparse. However, when the recentering 

251 is applied first and then followed by the EnKF analysis, as is the case in ERKF-16, the poor 

252 performance of the recentering in the data-sparse high southern latitudes is compensated by the 

253 EnKF analysis ability to use its background error covariance estimates to propagate information 

254 from data rich regions to data poor regions. 

255 

256 6. Conclusions 

257 The ensemble recentering Kalman filter (ERKF) first applies a recentering step to adjust the 

258 distribution of an ensemble of model states so that the ensemble mean is shifted to the prior 

259 position of the most likely sample without altering the ensemble statistics, after which an EnKF 

260 analysis step is applied. In our experiments, the ERKF is better able than the EnKF to improve 

261 model estimates of the unassimilated S variable when Argo T profiles are assimilated into the 

262 GEOS-5 CGCM. The effectiveness of the recentering step is attributed to its ability to gradually 

263 correct model biases over time and to the fact that it replaces the ensemble mean with a balanced 

264 state. Although the recentering is less effective in poorly observed regions, the EnKF analysis 

265 compensates for this fact by propagating infonnation from data rich areas to data poor areas 

266 using its background error covariance estimates. In contrast to particle filters, which must 
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integrate a very large number of model trajectories to be competitive with the EnKF, the ERKF 
can improve upon the performance of the EnKF without requiring a larger ensemble. 
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Figure Captions 


Figure 1. Schematic representation of the recentering step of the ERKF showing the sample 
mean, x (doubly outlined small open circle), the most likely sample x p (small open circle), the 

control observations, y (large open circle) and the recentering increment, A p (arrow connecting 

x and x p ). The filled circles represent particles other than x p . The cluster of circles shown to 

the left represents the prior sample forecast. The cluster to the right represents the recentered 
sample. 

Figure 2. RMS OMF differences with respect to the assimilated Argo T data averaged vertically 
and binned to 3°x3° boxes over April-June 2006 in (a) CE-16, (b) ER-16 (c) EnKF-16 and (d) 
ERKF- 16. 

Figure 3. Same as Figure 2 for the unassimilated Argo S data. 

Figure 4. Difference of CE-16 RMS OMF for Argo S from those of (a) EnKF-16 and (b) 
ERKF- 16 averaged over the upper 200 meters for the unassimilated Argo S data. Warm (cold) 
colors indicate areas where the analysis is closer to (further from) the observations than CE-16. 

Figure 5. Same as Figure 4 for the 200-2000 meter depth range. 
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